- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
CUDA: Fix bug in topk-moe for gpt-oss #16821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss
| Did a quick bench on the DGX Spark: GGML_CUDA=ON ./scripts/compare-commits.sh master pr/16821 llama-bench -m ./models/gpt-oss-20b/ggml-model-mxfp4.gguf -m models/gpt-oss-120b/ggml-model-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192 -p 2048 -n 32 -ub 2048 -mmp 0
 Taking into account the uncertainty of the numbers, it's hard to say there is a significant difference. I think your earlier observations that fusion is less effective at lower memory bandwidths might be correct. | 
| @ggerganov this PR would only add 4-5%, what I was this along with #16715 would be around 10% in TG. Edit: looks like it doesn't help at all in tg, so yes could be the memory thing | 
| Yes, here is comparing one commit right before #16715 and this PR: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no 
 build: 3cfa9c3 (6840) 
 build: 91cd070 (6867) Pretty much no difference on the DGX Spark. | 
| In principle, if you have more memory bandwidth then the percentage of the runtime going towards loading weights is comparatively low while the percentage going towards overhead and other I/O is high. So in that regard it makes sense that fusion would be more beneficial for high memory bandwidth GPUs. If you look at the numbers I posted in #16715 (comment) there were some AMD GPUs with comparatively low memory bandwidth that benefited from the fusion but that could also be because the overhead (in my experience) is higher with AMD. | 
* Ci (#11) (#12) * Fix cl (#7) * Rename build-amd.yml to build-amd.yml.disabled * Rename winget.yml to winget.yml.disabled * Rename server.yml to server.yml.disabled * Rename build.yml to build.yml.disabled * Update release.yml * Rename build-cmake-pkg.yml to build-cmake-pkg.yml.disabled * Rename build-linux-cross.yml to build-linux-cross.yml.disabled * Rename build-riscv-native.yml.disabled to build-riscv-native.yml * Rename docker.yml.disabled to docker.yml * Rename update-ops-docs.yml to update-ops-docs.yml.disabled * Remove macOS-arm64 job from release workflow Removed macOS-arm64 job and its associated steps from the release workflow. * CUDA: Fix bug in topk-moe for gpt-oss (ggml-org#16821) * CUDA: Fix bug in topk-moe for gpt-oss When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss * fix for qwen3 too * change ifndef to ifdef * vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (ggml-org#16793) This lets the copy to the destination device use the host-visible vidmem optimization. --------- Co-authored-by: Aman Gupta <[email protected]> Co-authored-by: Jeff Bolz <[email protected]>
Found this bug while looking at a related issue. When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes
test-backend-opsto still fuse nodes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss modelWith this change
Current master
Edit:
Looks the bug was qwen3 as well after the adding the clamp (#16702)
With this change
Current Master